This file is designed as a subset of the code contained in Coronavirus_Statistics_v002.Rmd. This file includes the latest code for analyzing data from The COVID Tracking Project. The COVID Tracking Project contains data on positive tests, hospitalizations, deaths, and the like, for coronavirus in the US. Downloaded data are unique by state and date.
Companion code for functions is in Coronavirus_Statistics_CTP_v003.R and Coronavirus_Statistics_Shared_v003.R. The code leverages tidyverse and a variable mapping file throughout:
# All functions assume that tidyverse and its components are loaded and available
# Other functions are declared in the sourcing files or use library::function()
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# If the same function is in both files, use the version from the more specific source
source("./Coronavirus_Statistics_Functions_Shared_v003.R")
source("./Coronavirus_Statistics_Functions_CTP_v003.R")
# Create a variable mapping file
varMapper <- c("cases"="Cases",
"newCases"="Increase in cases, most recent 30 days",
"casesroll7"="Rolling 7-day mean cases",
"deaths"="Deaths",
"newDeaths"="Increase in deaths, most recent 30 days",
"deathsroll7"="Rolling 7-day mean deaths",
"cpm"="Cases per million",
"cpm7"="Cases per day (7-day rolling mean) per million",
"newcpm"="Increase in cases, most recent 30 days, per million",
"dpm"="Deaths per million",
"dpm7"="Deaths per day (7-day rolling mean) per million",
"newdpm"="Increase in deaths, most recent 30 days, per million",
"hpm7"="Currently Hospitalized per million (7-day rolling mean)",
"tpm"="Tests per million",
"tpm7"="Tests per million per day (7-day rolling mean)"
)
The main function is readRunCOVIDTrackingProject(), which performs multiple tasks:
STEP 1: Extracts a file of population by state (by default uses 2015 population from usmap::statepop)
STEP 2a^: Downloads the latest data from COVID Tracking Project if requested
STEP 2b^: Reads in data from a specified local file (may have just been downloaded in step 2a), and checks control total trends against a previous version of the file
STEP 3^: Processed the loaded data file for keeping proper variables, dropping non-valid states, etc.
STEP 4^: Adds per-capita metrics for cases, deaths, tests, and hospitalizations
STEP 5: Adds existing clusters by state if passed as an argument to useClusters=, otherwise creates new segments based on user-defined parameters
STEP 6^^: Creates assessment plots for the state-level clusters
STEP 7^^: Creates consolidated plots of cases, hospitalizations, deaths, and tests
STEP 8^^: Optionally, creates plots of cumulative burden by segments and by state
STEP 9: Returns a list of key data frames, modeling objects, named cluster vectors, etc.
^ The user can instead specify a previously processed file and skip steps 2a, 2b, 3, and 4. The previously processed file needs to be formatted and filtered such that it can be used “as is”
^^ The user can skip the segment-level assessments by setting skipAssessmentPlots=TRUE
Broadly, there are several use cases for the function:
An example for each use case is created, with the caveat that data are not repeatedly downloaded (process is cached) to avoid unnecessary calls to the COVID Tracking Project server.
Further, files can be saved in RDS format so they can be loaded and used later.
The full process downloads data, creates segments, and assesses performance. Hierarchical segmentation with a heavy focus on deaths vs. cases tends to work well for creating state-level clusters:
# Create segments and download data from COVID Tracking Project
# Create 6 segments but place Vermont (a very small state and dendrogram outlier) in the New Hampshire segment
locDownload <- "./RInputFiles/Coronavirus/CV_downloaded_201025.csv"
test_hier5_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020",
downloadTo=if(file.exists(locDownload)) NULL else locDownload,
readFrom=locDownload,
compareFile=readFromRDS("test_hier5_201001")$dfRaw,
hierarchical=TRUE,
reAssignState=list("VT"="NH"),
kCut=6,
minShape=3,
ratioDeathvsCase = 5,
ratioTotalvsShape = 0.5,
minDeath=100,
minCase=10000
)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## state = col_character(),
## totalTestResultsSource = col_character(),
## dataQualityGrade = col_character(),
## lastUpdateEt = col_character(),
## dateModified = col_datetime(format = ""),
## checkTimeEt = col_character(),
## dateChecked = col_datetime(format = ""),
## fips = col_character(),
## hash = col_character(),
## grade = col_logical()
## )
## i Use `spec()` for the full column specifications.
##
## File is unique by state and date
##
##
## Overall control totals in file:
## # A tibble: 1 x 3
## positiveIncrease deathIncrease hospitalizedCurrently
## <dbl> <dbl> <dbl>
## 1 8531788 216646 8686442
##
## *** COMPARISONS TO REFERENCE FILE: compareFile
##
## Checkin for similarity of: column names
## In reference but not in current:
## In current but not in reference: probableCases
##
## Checkin for similarity of: states
## In reference but not in current:
## In current but not in reference:
##
## Checkin for similarity of: dates
## In reference but not in current:
## In current but not in reference: 2020-10-24 2020-10-23 2020-10-22 2020-10-21 2020-10-20 2020-10-19 2020-10-18 2020-10-17 2020-10-16 2020-10-15 2020-10-14 2020-10-13 2020-10-12 2020-10-11 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06 2020-10-05 2020-10-04 2020-10-03 2020-10-02 2020-10-01
##
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("date", "name")
## date name newValue oldValue
## 1 2020-03-28 positiveIncrease 19925 19692
## 2 2020-03-28 deathIncrease 544 538
## 3 2020-03-29 positiveIncrease 19348 19581
## 4 2020-03-29 deathIncrease 515 521
## Joining, by = c("date", "name")
## Warning: Removed 24 row(s) containing missing values (geom_path).
##
##
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("state", "name")
## state name newValue oldValue
## 1 HI positiveIncrease 12469 12289
## Rows: 13,157
## Columns: 55
## $ date <date> 2020-10-24, 2020-10-24, 2020-10-24, 20...
## $ state <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive <dbl> 13535, 183276, 105318, 0, 236772, 89281...
## $ probableCases <dbl> NA, 26330, 7105, NA, 5417, NA, 6492, 26...
## $ negative <dbl> 539585, 1138922, 1181805, 1616, 1462194...
## $ pending <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestResultsSource <chr> "totalTestsViral", "totalTestsViral", "...
## $ totalTestResults <dbl> 552746, 1295868, 1280018, 1616, 1693549...
## $ hospitalizedCurrently <dbl> 58, 920, 606, NA, 819, 3007, 550, 233, ...
## $ hospitalizedCumulative <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ inIcuCurrently <dbl> NA, NA, 242, NA, 191, 744, NA, NA, 21, ...
## $ inIcuCumulative <dbl> NA, 2021, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently <dbl> 8, NA, 94, NA, 87, NA, NA, NA, 6, NA, N...
## $ onVentilatorCumulative <dbl> NA, 1157, 808, NA, NA, NA, NA, NA, NA, ...
## $ recovered <dbl> 6939, 74439, 93977, NA, 39525, NA, 7463...
## $ dataQualityGrade <chr> "A", "A", "A+", "D", "A+", "B", "A", "B...
## $ lastUpdateEt <chr> "10/24/2020 03:59", "10/24/2020 11:00",...
## $ dateModified <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ checkTimeEt <chr> "10/23 23:59", "10/24 07:00", "10/23 20...
## $ death <dbl> 68, 2866, 1797, 0, 5869, 17311, 2076, 4...
## $ hospitalized <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ dateChecked <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ totalTestsViral <dbl> 552746, 1295868, 1280018, 1616, NA, 176...
## $ positiveTestsViral <dbl> 11644, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ negativeTestsViral <dbl> 540786, NA, 1181805, NA, NA, NA, NA, NA...
## $ positiveCasesViral <dbl> 13535, 156946, 98213, 0, 231355, 892810...
## $ deathConfirmed <dbl> 68, 2680, 1640, NA, 5581, NA, NA, 3674,...
## $ deathProbable <dbl> NA, 186, 157, NA, 288, NA, NA, 903, NA,...
## $ totalTestEncountersViral <dbl> NA, NA, NA, NA, NA, NA, 1790404, NA, 49...
## $ totalTestsPeopleViral <dbl> NA, NA, NA, NA, 1693549, NA, 1124409, N...
## $ totalTestsAntibody <dbl> NA, NA, NA, NA, 312232, NA, 179232, NA,...
## $ positiveTestsAntibody <dbl> NA, NA, NA, NA, NA, NA, 12741, NA, NA, ...
## $ negativeTestsAntibody <dbl> NA, NA, NA, NA, NA, NA, 166491, NA, NA,...
## $ totalTestsPeopleAntibody <dbl> NA, 63359, NA, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen <dbl> NA, NA, 46505, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntigen <dbl> NA, NA, 7891, NA, NA, NA, NA, NA, NA, N...
## $ totalTestsAntigen <dbl> NA, NA, 21856, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen <dbl> NA, NA, 3300, NA, NA, NA, NA, NA, NA, N...
## $ fips <chr> "02", "01", "05", "60", "04", "06", "08...
## $ positiveIncrease <dbl> 374, 2360, 1183, 0, 890, 5945, 1350, 0,...
## $ negativeIncrease <dbl> 0, 5064, 11643, 0, 11213, 119941, 9351,...
## $ total <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ totalTestResultsIncrease <dbl> 0, 6095, 12517, 0, 12080, 125886, 25756...
## $ posNeg <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ deathIncrease <dbl> 0, 7, 15, 0, 4, 49, 6, 0, 0, 2, 76, 42,...
## $ hospitalizedIncrease <dbl> 0, 0, 29, 0, 76, 0, 79, 0, 0, 0, 174, 9...
## $ hash <chr> "280ee400bd797c20b77218c9e54a0b6615f91a...
## $ commercialScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
##
##
## Control totals - note that validState other than TRUE will be discarded
##
## # A tibble: 2 x 6
## validState cases deaths hosp tests n
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 66880 888 NA 471313 1115
## 2 TRUE 8464908 215758 NA 130976369 12042
## Rows: 12,042
## Columns: 6
## $ date <date> 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24,...
## $ state <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", ...
## $ cases <dbl> 374, 2360, 1183, 890, 5945, 1350, 0, 97, 160, 4471, 1846, 14...
## $ deaths <dbl> 0, 7, 15, 4, 49, 6, 0, 0, 2, 76, 42, 3, 11, 9, 63, 26, 0, 8,...
## $ hosp <dbl> 58, 920, 606, 819, 3007, 550, 233, 93, 103, 2162, 1684, 71, ...
## $ tests <dbl> 0, 6095, 12517, 12080, 125886, 25756, 0, 5800, 2164, 72309, ...
## Rows: 12,042
## Columns: 14
## $ date <date> 2020-01-22, 2020-01-22, 2020-01-23, 2020-01-23, 2020-01-24,...
## $ state <chr> "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", ...
## $ cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hosp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tests <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ cpm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm <dbl> 0.0000000, 0.0000000, 0.1471796, 0.0000000, 0.0000000, 0.000...
## $ cpm7 <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm7 <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm7 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm7 <dbl> NA, NA, NA, NA, NA, NA, 0.04205130, 0.00000000, 0.06307695, ...
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'date', 'cluster' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
##
## Recency is defined as 2020-09-25 through current
##
## Recency is defined as 2020-09-25 through current
## `summarise()` regrouping output by 'state', 'cluster', 'date' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
A modified process gathers new data and assesses existing state-level clusters:
# Use existing segments with updated data
locDownload <- "./RInputFiles/Coronavirus/CV_downloaded_201025.csv"
test_old_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020",
downloadTo=if (file.exists(locDownload)) NULL else locDownload,
readFrom=locDownload,
compareFile=readFromRDS("test_hier5_201001")$dfRaw,
useClusters=readFromRDS("test_hier5_201001")$useClusters
)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## state = col_character(),
## totalTestResultsSource = col_character(),
## dataQualityGrade = col_character(),
## lastUpdateEt = col_character(),
## dateModified = col_datetime(format = ""),
## checkTimeEt = col_character(),
## dateChecked = col_datetime(format = ""),
## fips = col_character(),
## hash = col_character(),
## grade = col_logical()
## )
## i Use `spec()` for the full column specifications.
##
## File is unique by state and date
##
##
## Overall control totals in file:
## # A tibble: 1 x 3
## positiveIncrease deathIncrease hospitalizedCurrently
## <dbl> <dbl> <dbl>
## 1 8531788 216646 8686442
##
## *** COMPARISONS TO REFERENCE FILE: compareFile
##
## Checkin for similarity of: column names
## In reference but not in current:
## In current but not in reference: probableCases
##
## Checkin for similarity of: states
## In reference but not in current:
## In current but not in reference:
##
## Checkin for similarity of: dates
## In reference but not in current:
## In current but not in reference: 2020-10-24 2020-10-23 2020-10-22 2020-10-21 2020-10-20 2020-10-19 2020-10-18 2020-10-17 2020-10-16 2020-10-15 2020-10-14 2020-10-13 2020-10-12 2020-10-11 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06 2020-10-05 2020-10-04 2020-10-03 2020-10-02 2020-10-01
##
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("date", "name")
## date name newValue oldValue
## 1 2020-03-28 positiveIncrease 19925 19692
## 2 2020-03-28 deathIncrease 544 538
## 3 2020-03-29 positiveIncrease 19348 19581
## 4 2020-03-29 deathIncrease 515 521
## Joining, by = c("date", "name")
## Warning: Removed 24 row(s) containing missing values (geom_path).
##
##
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("state", "name")
## state name newValue oldValue
## 1 HI positiveIncrease 12469 12289
## Rows: 13,157
## Columns: 55
## $ date <date> 2020-10-24, 2020-10-24, 2020-10-24, 20...
## $ state <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive <dbl> 13535, 183276, 105318, 0, 236772, 89281...
## $ probableCases <dbl> NA, 26330, 7105, NA, 5417, NA, 6492, 26...
## $ negative <dbl> 539585, 1138922, 1181805, 1616, 1462194...
## $ pending <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestResultsSource <chr> "totalTestsViral", "totalTestsViral", "...
## $ totalTestResults <dbl> 552746, 1295868, 1280018, 1616, 1693549...
## $ hospitalizedCurrently <dbl> 58, 920, 606, NA, 819, 3007, 550, 233, ...
## $ hospitalizedCumulative <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ inIcuCurrently <dbl> NA, NA, 242, NA, 191, 744, NA, NA, 21, ...
## $ inIcuCumulative <dbl> NA, 2021, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently <dbl> 8, NA, 94, NA, 87, NA, NA, NA, 6, NA, N...
## $ onVentilatorCumulative <dbl> NA, 1157, 808, NA, NA, NA, NA, NA, NA, ...
## $ recovered <dbl> 6939, 74439, 93977, NA, 39525, NA, 7463...
## $ dataQualityGrade <chr> "A", "A", "A+", "D", "A+", "B", "A", "B...
## $ lastUpdateEt <chr> "10/24/2020 03:59", "10/24/2020 11:00",...
## $ dateModified <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ checkTimeEt <chr> "10/23 23:59", "10/24 07:00", "10/23 20...
## $ death <dbl> 68, 2866, 1797, 0, 5869, 17311, 2076, 4...
## $ hospitalized <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ dateChecked <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ totalTestsViral <dbl> 552746, 1295868, 1280018, 1616, NA, 176...
## $ positiveTestsViral <dbl> 11644, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ negativeTestsViral <dbl> 540786, NA, 1181805, NA, NA, NA, NA, NA...
## $ positiveCasesViral <dbl> 13535, 156946, 98213, 0, 231355, 892810...
## $ deathConfirmed <dbl> 68, 2680, 1640, NA, 5581, NA, NA, 3674,...
## $ deathProbable <dbl> NA, 186, 157, NA, 288, NA, NA, 903, NA,...
## $ totalTestEncountersViral <dbl> NA, NA, NA, NA, NA, NA, 1790404, NA, 49...
## $ totalTestsPeopleViral <dbl> NA, NA, NA, NA, 1693549, NA, 1124409, N...
## $ totalTestsAntibody <dbl> NA, NA, NA, NA, 312232, NA, 179232, NA,...
## $ positiveTestsAntibody <dbl> NA, NA, NA, NA, NA, NA, 12741, NA, NA, ...
## $ negativeTestsAntibody <dbl> NA, NA, NA, NA, NA, NA, 166491, NA, NA,...
## $ totalTestsPeopleAntibody <dbl> NA, 63359, NA, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen <dbl> NA, NA, 46505, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntigen <dbl> NA, NA, 7891, NA, NA, NA, NA, NA, NA, N...
## $ totalTestsAntigen <dbl> NA, NA, 21856, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen <dbl> NA, NA, 3300, NA, NA, NA, NA, NA, NA, N...
## $ fips <chr> "02", "01", "05", "60", "04", "06", "08...
## $ positiveIncrease <dbl> 374, 2360, 1183, 0, 890, 5945, 1350, 0,...
## $ negativeIncrease <dbl> 0, 5064, 11643, 0, 11213, 119941, 9351,...
## $ total <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ totalTestResultsIncrease <dbl> 0, 6095, 12517, 0, 12080, 125886, 25756...
## $ posNeg <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ deathIncrease <dbl> 0, 7, 15, 0, 4, 49, 6, 0, 0, 2, 76, 42,...
## $ hospitalizedIncrease <dbl> 0, 0, 29, 0, 76, 0, 79, 0, 0, 0, 174, 9...
## $ hash <chr> "280ee400bd797c20b77218c9e54a0b6615f91a...
## $ commercialScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
##
##
## Control totals - note that validState other than TRUE will be discarded
##
## # A tibble: 2 x 6
## validState cases deaths hosp tests n
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 66880 888 NA 471313 1115
## 2 TRUE 8464908 215758 NA 130976369 12042
## Rows: 12,042
## Columns: 6
## $ date <date> 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24,...
## $ state <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", ...
## $ cases <dbl> 374, 2360, 1183, 890, 5945, 1350, 0, 97, 160, 4471, 1846, 14...
## $ deaths <dbl> 0, 7, 15, 4, 49, 6, 0, 0, 2, 76, 42, 3, 11, 9, 63, 26, 0, 8,...
## $ hosp <dbl> 58, 920, 606, 819, 3007, 550, 233, 93, 103, 2162, 1684, 71, ...
## $ tests <dbl> 0, 6095, 12517, 12080, 125886, 25756, 0, 5800, 2164, 72309, ...
## Rows: 12,042
## Columns: 14
## $ date <date> 2020-01-22, 2020-01-22, 2020-01-23, 2020-01-23, 2020-01-24,...
## $ state <chr> "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", ...
## $ cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hosp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tests <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ cpm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm <dbl> 0.0000000, 0.0000000, 0.1471796, 0.0000000, 0.0000000, 0.000...
## $ cpm7 <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm7 <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm7 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm7 <dbl> NA, NA, NA, NA, NA, NA, 0.04205130, 0.00000000, 0.06307695, ...
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'date', 'cluster' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
##
## Recency is defined as 2020-09-25 through current
##
## Recency is defined as 2020-09-25 through current
## `summarise()` regrouping output by 'state', 'cluster', 'date' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
A different clustering approach can be assessed using existing data. A common example would be exploring kmeans clustering with the previously processed state-level data:
# Test function for k-means clustering using the per capita data file previously created
test_km5_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020",
dfPerCapita=test_hier6_201025$dfPerCapita,
hierarchical=FALSE,
minShape=3,
ratioDeathvsCase = 5,
ratioTotalvsShape = 0.5,
minDeath=100,
minCase=10000,
nCenters=5,
testCenters=1:10,
iter.max=20,
nstart=10,
seed=2008261400
)
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
##
## Cluster means and counts
## 1 2 3 4 5
## . 9.00 8.00 5.00 20.00 9.00
## totalCases 1.17 0.78 0.86 0.91 0.47
## totalDeaths 3.34 3.50 6.32 1.41 1.49
## cases_3 0.01 0.03 0.07 0.01 0.03
## deaths_3 0.04 0.09 0.15 0.05 0.23
## cases_4 0.04 0.18 0.35 0.04 0.09
## deaths_4 0.38 1.50 2.27 0.49 1.22
## cases_5 0.05 0.18 0.17 0.05 0.11
## deaths_5 0.46 1.62 1.37 0.50 1.15
## cases_6 0.13 0.07 0.07 0.07 0.09
## deaths_6 0.40 0.63 0.41 0.38 0.67
## cases_7 0.31 0.12 0.10 0.17 0.14
## deaths_7 1.03 0.33 0.26 0.58 0.49
## cases_8 0.20 0.13 0.08 0.18 0.11
## deaths_8 1.26 0.25 0.25 0.83 0.45
## cases_9 0.14 0.13 0.06 0.21 0.11
## deaths_9 0.89 0.27 0.16 0.99 0.41
## cases_10 0.13 0.17 0.10 0.28 0.17
## deaths_10 0.55 0.31 0.13 1.16 0.34
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'date', 'cluster' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
##
## Recency is defined as 2020-09-25 through current
##
## Recency is defined as 2020-09-25 through current
## `summarise()` regrouping output by 'state', 'cluster', 'date' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
The silhouette plot is suggestive that k-means may not be an ideal approach, or at least that there is no obviously optimal number of segments.
combine_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020",
dfPerCapita=test_hier6_201025$dfPerCapita,
useClusters=readFromRDS("test_hier5_201001")$useClusters,
skipAssessmentPlots=TRUE
)
str(combine_201025)
## List of 8
## $ stateData : tibble [51 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ state: chr [1:51] "AL" "AK" "AZ" "AR" ...
## ..$ name : chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## ..$ pop : num [1:51] 4858979 738432 6828065 2978204 39144818 ...
## $ dfRaw : NULL
## $ dfFiltered : NULL
## $ dfPerCapita : tibble [12,042 x 14] (S3: tbl_df/tbl/data.frame)
## ..$ date : Date[1:12042], format: "2020-01-22" "2020-01-22" ...
## ..$ state : chr [1:12042] "MA" "WA" "MA" "WA" ...
## ..$ cases : num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ deaths: num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ hosp : num [1:12042] NA NA NA NA NA NA NA NA NA NA ...
## ..$ tests : num [1:12042] 0 0 1 0 0 0 0 0 0 0 ...
## ..$ cpm : num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ dpm : num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ hpm : num [1:12042] NA NA NA NA NA NA NA NA NA NA ...
## ..$ tpm : num [1:12042] 0 0 0.147 0 0 ...
## ..$ cpm7 : num [1:12042] NA NA NA NA NA NA 0 0 0 0 ...
## ..$ dpm7 : num [1:12042] NA NA NA NA NA NA 0 0 0 0 ...
## ..$ hpm7 : num [1:12042] NA NA NA NA NA NA NA NA NA NA ...
## ..$ tpm7 : num [1:12042] NA NA NA NA NA ...
## $ useClusters : Named int [1:51] 1 2 1 2 1 3 4 5 5 2 ...
## ..- attr(*, "names")= chr [1:51] "AK" "AL" "AR" "AZ" ...
## $ plotData : NULL
## $ consolidatedPlotData: NULL
## $ clCum : NULL
The list is properly formatted (though lacking the plotting and cumulative components) such that it could be used by other functions that rely on the data being available in this format.
saveToRDS(test_hier5_201025, ovrWriteError=FALSE)
saveToRDS(test_old_201025, ovrWriteError=FALSE)
saveToRDS(test_km5_201025, ovrWriteError=FALSE)
saveToRDS(combine_201025, ovrWriteError=FALSE)